SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
These slides are developed based on:
Students are encouraged to refer to the relevant chapters for additional detail and examples.
By the end of this seminar, you will be able to:
A working definition
Data science is humans measuring things, typically related to other humans, and using sophisticated averaging to explain and predict.
This definition emphasises:
Data science allows us to:
“In any case the key elements are the same… What is the dataset? Who generated it and why? What is missing?” (Alexander, 2023, p. 3) TSwD
Even seemingly simple things are hard to measure:
What is the minimum we need to capture the essence?
Fundamental challenges of statistical inference
These challenges arise in nearly every application of data analysis!
The problem: We usually only observe a sample of the population we care about.
Examples:
Selection bias: The people we observe may differ systematically from those we don’t.
Who is systematically missing from our data?
The problem: We want to know what would have happened if we had made a different choice.
In experiments:
In observational studies:
Correlation ≠ Causation
Just because two things are associated doesn’t mean one causes the other!
The problem: What we measure is rarely what we actually want to know.
What we measure
What we want to know
“Most of the time our data do not record exactly what we would ideally like to study.” (Gelman et al., 2021, p. 3) ROS
The HDI claims to measure “human development” using:
But: Most variation between US states comes from income, not health or education!
The lesson: Always examine where your numbers come from.
What does your measure actually capture?
Definition
A measure is valid to the degree that it represents what you are trying to measure.
Examples of validity problems:
Key question: Is there general agreement that the observations are closely related to the intended construct?
Definition
A reliable measure is one that is precise and stable—if we measure again, we get similar values.
Ways to assess reliability:
The key insight
Variability in our data should reflect real differences, not measurement error.
High validity, low reliability
High reliability, low validity
We need both!
A measure can be reliable without being valid, but a valid measure must be reasonably reliable.
When exploring data, remember:
“All graphical displays can be considered as comparisons.”
Effective graphs:
Plan → Simulate → Acquire → Explore → Share
This workflow guides everything we do in this course.
Why plan first?
“In Alice’s Adventures in Wonderland, Alice asks the Cheshire Cat which way she should go. The Cat replies that it depends on where Alice wants to get to.”
Planning involves:
Practical tip
Ten minutes with paper and pen is often enough to get started!
Why simulate data?
For data cleaning:
For modelling:
“Simulation is often cheap—almost free given modern computing resources—and fast.” (Alexander, 2023, p. 5) TSwD
Data acquisition is often overlooked but critical!
Key considerations:
Data never “speak for themselves”
They are shaped by the choices of those who collected and prepared them.
Exploratory Data Analysis (EDA) involves:
This is an iterative process that continues throughout your project.
“It is difficult to delineate where EDA ends and formal statistical modelling begins.”
Communication is the most important element
Simple analysis, communicated well, is more valuable than complicated analysis communicated poorly.
Clear communication means:
Advantages:
For this course:
For beginners, we recommend starting with Posit Cloud:
Why Posit Cloud?
Four main panes:
Key shortcuts:
Ctrl/Cmd + Enter — Run current lineCtrl/Cmd + Shift + Enter — Run chunkTab — AutocompleteCtrl/Cmd + S — SavePackages extend R’s functionality.
Key packages for this course
Quarto combines text and code for reproducible research.
Key elements:
Code chunks contain R code that will be executed:
Control how chunks behave:
Let’s walk through the complete workflow with a real example:
Question: How many seats did each party win in the 2022 Australian Federal Election?
The workflow
Data we need:
| Division | Party |
|---|---|
| Adelaide | Labor |
| Aston | Liberal |
| … | … |
Graph we want:
A bar chart showing the number of seats won by each party.
# A tibble: 6 × 2
division party
<int> <chr>
1 1 Liberal
2 2 Liberal
3 3 Greens
4 4 Other
5 5 Liberal
6 6 Nationals
Australia is a parliamentary democracy with 151 seats in the House of Representatives. The 2022 Federal Election saw the Labor Party win 77 seats, followed by the Liberal Party with 48 seats.
Key findings:
Telling Stories with Data:
Regression and Other Stories:
Reading strategy
Focus on the concepts first. The technical details will make more sense as we practise.
Week 2: Reproducible Workflows and Version Control
Before next week
Office hours:
Email:
Resources: